Telemetry Data for CI Clusters
Every cluster running an OpenShift CI job sends some operational data back to Red Hat via Telemetry. This data gets stored as Prometheus metrics in a Thanos deployment at Red Hat. Some examples of the prometheus metrics collected here include CPU and memory capacity, operators installed, alerts fired, provider platform, etc. Thus, in addition to high level test run data on testgrid and prow, we also have detailed time series data available for the CI clusters that ran the tests.
In this notebook, we will show how to access this telemetry data using some open source tools developed by the AIOps team. Specifically we will show that, given a specific CI job run, how to get the telemetry data associated with the cluster that ran it.
NOTE: Since this data is currently hosted on a Red Hat internal Thanos, only those users with access to it will be able to run this notebook to get "live" data. To ensure that the wider open source community is also able to use this data for further analysis, we will use this notebook to extract a snippet of this data and save it on our public GitHub repo.
# import all the required libraries
import os
import warnings
import datetime as dt
from tqdm.notebook import tqdm
from IPython.display import display
from dotenv import load_dotenv, find_dotenv
from urllib3.exceptions import InsecureRequestWarning
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from prometheus_api_client import (
PrometheusConnect,
MetricSnapshotDataFrame,
MetricRangeDataFrame,
)
import sys
sys.path.insert(1, "../TestGrid/metrics")
from ipynb.fs.defs.metric_template import save_to_disk # noqa: E402
load_dotenv(find_dotenv())True# config for a pretty notebook
sns.set()
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore", category=InsecureRequestWarning)Data Access Setup
In this section, we will configure the prometheus-api-client-python tool to pull data from our Thanos instance. That is, set the value of PROM_URL to the Thanos endpoint, and set the value of PROM_ACCESS_TOKEN to the bearer token for authentication. We will also set the timestamp from which telemetry data is to be pulled.
In order to get access to the token, you can follow either one of these steps:
1. Visit https://datahub.psi.redhat.com/. Click on your profile (top right) and select Copy Login Command from the drop down menu. This will copy a command that will look something like: oc login https://datahub.psi.redhat.com:443 --token=<YOUR_TOKEN>. The value in YOUR_TOKEN is the required token.
2. From the command line, run oc whoami --show-token. Ensure that the output of oc project is https://datahub.psi.redhat.com/. This will output the required token.
NOTE: The above methods can only used if you are on Red Hat VPN.
# prometheus from which metrics are to be fetched
PROM_URL = os.getenv("PROM_URL")
PROM_ACCESS_TOKEN = os.getenv("PROM_ACCESS_TOKEN")# prometheus connector object
pc = PrometheusConnect(
url=PROM_URL,
disable_ssl=True,
headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
)# timestamp for which prometheus queries will be evaluated
query_eval_time = dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(hours=6)
query_eval_ts = query_eval_time.timestamp()# which metrics to fetch
# we will try to get all metrics, but leave out ones that may have potentially sensitive data
metrics_to_fetch = [
m
for m in pc.all_metrics()
if "subscription" not in m and "internal" not in m and "url" not in m
]# these fields are either irrelevant or contain something that could potentially be sensitive
# either way, these likely wont be useful for analysis anyway so exclude them when reading data
drop_cols = [
"prometheus",
"tenant_id",
"endpoint",
"instance",
"receive",
"url",
]Get All Data for a Given Job Build
In this section, we will get all the prometheus metrics corresponding to a given job name and build id. The job name and build id can be obtained either directly from the testgrid UI, or from the query and changelists fields respectively in the testgrid json as shown in the testgrid metadata EDA notebook.
One of the metrics stored in Thanos is cluster_installer. This metric describes what entity triggered the install of each cluster. For the clusters that run OpenShift CI jobs, the invoker label value in this metric is set to openshift-internal-ci/{job_name}/{build_id}.
Therefore, we can get all data for a given job build by first finding the ID of the cluster that ran it (using cluster_installer), and then querying prometheus for metrics where the _id label value equals this cluster ID. These steps are demonstrated through the example below.
# example job and build
job_name = "periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade"
build_id = "1380452039472975872"# get installer info for the job/build
job_build_cluster_installer = pc.custom_query(
query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id}"}}',
params={"time": query_eval_ts},
)
# extract cluster id out of the installer info metric
cluster_id = job_build_cluster_installer[0]["metric"]["_id"]Get One Metric
Before we fetch all the metrics, let's fetch just one metric and familiarize ourselves with the data format, and understand how to interpret it. In the cell below, we will look at an example metric, cluster:cpu_capacity:sum.
# fetch the metric and format it into a df
metric_df = MetricSnapshotDataFrame(
pc.custom_query(
query=f'cluster:capacity_cpu_cores:sum{{_id="{cluster_id}"}}',
params={"time": query_eval_ts},
)
)
# drop irrelavant data
metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)
metric_df| __name__ | _id | label_beta_kubernetes_io_instance_type | label_kubernetes_io_arch | label_node_openshift_io_os_id | timestamp | value | label_node_role_kubernetes_io | |
|---|---|---|---|---|---|---|---|---|
| 0 | cluster:capacity_cpu_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m4.xlarge | amd64 | rhcos | 1.617966e+09 | 12 | NaN |
| 1 | cluster:capacity_cpu_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m5.xlarge | amd64 | rhcos | 1.617966e+09 | 12 | master |
HOW TO READ THIS DATAFRAME
In the above dataframe, each column represents a "label" of the prometheus metric, and each row represents a different "label configuration". In this example, the first row has label_node_role_kubernetes_io = NaN and value = 12, and the second row has label_node_role_kubernetes_io = master and value = 12. This means that in this cluster, the master node had 12 CPU cores, and the worker node also had 12 CPU cores.
To learn more about labels, label configurations, and the prometheus data model in general, please check out their official documentation here.
Get All Metrics
Now that we understand the data structure of the metrics, let's fetch all the metrics and concatenate them into one single dataframe.
# let's combine all the metrics into one dataframe
# for the above mentioned job name and build name.
all_metrics_df = pd.DataFrame()
for metric in metrics_to_fetch:
# fetch metric for the cluster
metric_df = MetricSnapshotDataFrame(
pc.custom_query(
query=f'{metric}{{_id="{cluster_id}"}}',
params={"time": query_eval_ts},
)
)
if len(metric_df) > 0:
# drop irrelevant cols, if any
metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)
# show a glimpse of data
print(f"Metric = {metric}")
display(metric_df.head())
# combine all the metrics data.
all_metrics_df = pd.concat(
[
all_metrics_df,
metric_df,
],
axis=0,
join="outer",
ignore_index=True,
)Metric = alerts
| __name__ | _id | alertname | alertstate | severity | timestamp | value | |
|---|---|---|---|---|---|---|---|
| 0 | alerts | a4c5e284-11dd-4b9c-af67-a4776665f9df | AlertmanagerReceiversNotConfigured | firing | warning | 1.617966e+09 | 1 |
| 1 | alerts | a4c5e284-11dd-4b9c-af67-a4776665f9df | Watchdog | firing | none | 1.617966e+09 | 1 |
Metric = cco_credentials_mode
| __name__ | _id | container | job | mode | namespace | pod | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cco_credentials_mode | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-rbac-proxy | cco-metrics | mint | openshift-cloud-credential-operator | cloud-credential-operator-578dd486f4-mnb2j | cco-metrics | 1.617966e+09 | 1 |
Metric = cluster:apiserver_current_inflight_requests:sum:max_over_time:2m
| __name__ | _id | apiserver | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:apiserver_current_inflight_requests:su... | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-apiserver | 1.617966e+09 | 18 |
| 1 | cluster:apiserver_current_inflight_requests:su... | a4c5e284-11dd-4b9c-af67-a4776665f9df | openshift-apiserver | 1.617966e+09 | 3 |
Metric = cluster:capacity_cpu_cores:sum
| __name__ | _id | label_beta_kubernetes_io_instance_type | label_kubernetes_io_arch | label_node_openshift_io_os_id | timestamp | value | label_node_role_kubernetes_io | |
|---|---|---|---|---|---|---|---|---|
| 0 | cluster:capacity_cpu_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m4.xlarge | amd64 | rhcos | 1.617966e+09 | 12 | NaN |
| 1 | cluster:capacity_cpu_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m5.xlarge | amd64 | rhcos | 1.617966e+09 | 12 | master |
Metric = cluster:capacity_memory_bytes:sum
| __name__ | _id | label_beta_kubernetes_io_instance_type | timestamp | value | label_node_role_kubernetes_io | |
|---|---|---|---|---|---|---|
| 0 | cluster:capacity_memory_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m4.xlarge | 1.617966e+09 | 50432839680 | NaN |
| 1 | cluster:capacity_memory_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m5.xlarge | 1.617966e+09 | 49156497408 | master |
Metric = cluster:cpu_usage_cores:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:cpu_usage_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 8.916476190476182 |
Metric = cluster:kube_persistentvolume_plugin_type_counts:sum
| __name__ | _id | plugin_name | volume_mode | timestamp | value | |
|---|---|---|---|---|---|---|
| 0 | cluster:kube_persistentvolume_plugin_type_coun... | a4c5e284-11dd-4b9c-af67-a4776665f9df | kubernetes.io/aws-ebs | Filesystem | 1.617966e+09 | 2 |
Metric = cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum
| __name__ | _id | provisioner | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:kube_persistentvolumeclaim_resource_re... | a4c5e284-11dd-4b9c-af67-a4776665f9df | kubernetes.io/aws-ebs | 1.617966e+09 | 21474836480 |
Metric = cluster:kubelet_volume_stats_used_bytes:provisioner:sum
| __name__ | _id | provisioner | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:kubelet_volume_stats_used_bytes:provis... | a4c5e284-11dd-4b9c-af67-a4776665f9df | kubernetes.io/aws-ebs | 1.617966e+09 | 525832192 |
Metric = cluster:memory_usage_bytes:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:memory_usage_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 35710898176 |
Metric = cluster:network_attachment_definition_enabled_instance_up:max
| __name__ | _id | networks | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:network_attachment_definition_enabled_... | a4c5e284-11dd-4b9c-af67-a4776665f9df | any | 1.617966e+09 | 0 |
| 1 | cluster:network_attachment_definition_enabled_... | a4c5e284-11dd-4b9c-af67-a4776665f9df | ib-sriov | 1.617966e+09 | 0 |
| 2 | cluster:network_attachment_definition_enabled_... | a4c5e284-11dd-4b9c-af67-a4776665f9df | sriov | 1.617966e+09 | 0 |
Metric = cluster:network_attachment_definition_instances:max
| __name__ | _id | networks | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:network_attachment_definition_instance... | a4c5e284-11dd-4b9c-af67-a4776665f9df | any | 1.617966e+09 | 0 |
| 1 | cluster:network_attachment_definition_instance... | a4c5e284-11dd-4b9c-af67-a4776665f9df | ib-sriov | 1.617966e+09 | 0 |
| 2 | cluster:network_attachment_definition_instance... | a4c5e284-11dd-4b9c-af67-a4776665f9df | sriov | 1.617966e+09 | 0 |
Metric = cluster:node_instance_type_count:sum
| __name__ | _id | label_beta_kubernetes_io_instance_type | label_kubernetes_io_arch | label_node_openshift_io_os_id | timestamp | value | label_node_role_kubernetes_io | |
|---|---|---|---|---|---|---|---|---|
| 0 | cluster:node_instance_type_count:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m4.xlarge | amd64 | rhcos | 1.617966e+09 | 6 | NaN |
| 1 | cluster:node_instance_type_count:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | m5.xlarge | amd64 | rhcos | 1.617966e+09 | 3 | master |
Metric = cluster:telemetry_selected_series:count
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:telemetry_selected_series:count | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 453 |
Metric = cluster:usage:containers:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:containers:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 1027 |
Metric = cluster:usage:ingress_frontend_bytes_in:rate5m:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:ingress_frontend_bytes_in:rate5m... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 2589.339455740741 |
Metric = cluster:usage:ingress_frontend_bytes_out:rate5m:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:ingress_frontend_bytes_out:rate5... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 21394.726972761906 |
Metric = cluster:usage:ingress_frontend_connections:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:ingress_frontend_connections:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 4 |
Metric = cluster:usage:kube_node_ready:avg5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:kube_node_ready:avg5m | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 1 |
Metric = cluster:usage:kube_schedulable_node_ready_reachable:avg5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:kube_schedulable_node_ready_reac... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 1 |
Metric = cluster:usage:openshift:ingress_request_error:fraction5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:openshift:ingress_request_error:... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 0 |
Metric = cluster:usage:openshift:ingress_request_total:irate5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:openshift:ingress_request_total:... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 3.633333333333333 |
Metric = cluster:usage:openshift:kube_running_pod_ready:avg
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:openshift:kube_running_pod_ready... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 0.9764150943396227 |
Metric = cluster:usage:pods:terminal:workload:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:pods:terminal:workload:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 3 |
Metric = cluster:usage:resources:sum
| __name__ | _id | resource | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:usage:resources:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | alertmanagerconfigs.monitoring.coreos.com | 1.617966e+09 | 0 |
| 1 | cluster:usage:resources:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | alertmanagers.monitoring.coreos.com | 1.617966e+09 | 1 |
| 2 | cluster:usage:resources:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | apiservers.config.openshift.io | 1.617966e+09 | 1 |
| 3 | cluster:usage:resources:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | apiservices.apiregistration.k8s.io | 1.617966e+09 | 79 |
| 4 | cluster:usage:resources:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | authentications.config.openshift.io | 1.617966e+09 | 1 |
Metric = cluster:usage:workload:capacity_physical_cpu_core_seconds
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:capacity_physical_cpu_c... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 5742 |
Metric = cluster:usage:workload:capacity_physical_cpu_cores:max:5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:capacity_physical_cpu_c... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 6 |
Metric = cluster:usage:workload:capacity_physical_cpu_cores:min:5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:capacity_physical_cpu_c... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 6 |
Metric = cluster:usage:workload:ingress_request_error:fraction5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:ingress_request_error:f... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 0 |
Metric = cluster:usage:workload:ingress_request_total:irate5m
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:ingress_request_total:i... | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 0 |
Metric = cluster:usage:workload:kube_running_pod_ready:avg
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | cluster:usage:workload:kube_running_pod_ready:avg | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 1 |
Metric = cluster:virt_platform_nodes:sum
| __name__ | _id | type | timestamp | value | |
|---|---|---|---|---|---|
| 0 | cluster:virt_platform_nodes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | aws | 1.617966e+09 | 6 |
| 1 | cluster:virt_platform_nodes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | kvm | 1.617966e+09 | 3 |
| 2 | cluster:virt_platform_nodes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | xen | 1.617966e+09 | 3 |
| 3 | cluster:virt_platform_nodes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | xen-hvm | 1.617966e+09 | 3 |
Metric = cluster_feature_set
| __name__ | _id | container | job | namespace | pod | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_feature_set | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-apiserver-operator | metrics | openshift-kube-apiserver-operator | kube-apiserver-operator-568c9bd46-vc2m6 | metrics | 1.617966e+09 | 1 |
Metric = cluster_infrastructure_provider
| __name__ | _id | container | job | namespace | pod | region | service | type | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_infrastructure_provider | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-apiserver-operator | metrics | openshift-kube-apiserver-operator | kube-apiserver-operator-568c9bd46-vc2m6 | us-east-1 | metrics | AWS | 1.617966e+09 | 1 |
Metric = cluster_installer
| __name__ | _id | invoker | job | namespace | pod | service | type | version | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_installer | a4c5e284-11dd-4b9c-af67-a4776665f9df | openshift-internal-ci/periodic-ci-openshift-re... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | openshift-install | v4.6.0 | 1.617966e+09 | 1 |
Metric = cluster_legacy_scheduler_policy
| __name__ | _id | job | namespace | pod | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | cluster_legacy_scheduler_policy | a4c5e284-11dd-4b9c-af67-a4776665f9df | metrics | openshift-kube-scheduler-operator | openshift-kube-scheduler-operator-69d8d7c996-g... | metrics | 1.617966e+09 | 0 |
Metric = cluster_master_schedulable
| __name__ | _id | job | namespace | pod | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | cluster_master_schedulable | a4c5e284-11dd-4b9c-af67-a4776665f9df | metrics | openshift-kube-scheduler-operator | openshift-kube-scheduler-operator-69d8d7c996-g... | metrics | 1.617966e+09 | 0 |
Metric = cluster_operator_conditions
| __name__ | _id | condition | job | name | namespace | pod | reason | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_operator_conditions | a4c5e284-11dd-4b9c-af67-a4776665f9df | Available | cluster-version-operator | authentication | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | AsExpected | cluster-version-operator | 1.617966e+09 | 1 |
| 1 | cluster_operator_conditions | a4c5e284-11dd-4b9c-af67-a4776665f9df | Available | cluster-version-operator | baremetal | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | AsExpected | cluster-version-operator | 1.617966e+09 | 1 |
| 2 | cluster_operator_conditions | a4c5e284-11dd-4b9c-af67-a4776665f9df | Available | cluster-version-operator | cloud-credential | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | NaN | cluster-version-operator | 1.617966e+09 | 1 |
| 3 | cluster_operator_conditions | a4c5e284-11dd-4b9c-af67-a4776665f9df | Available | cluster-version-operator | cluster-autoscaler | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | AsExpected | cluster-version-operator | 1.617966e+09 | 1 |
| 4 | cluster_operator_conditions | a4c5e284-11dd-4b9c-af67-a4776665f9df | Available | cluster-version-operator | config-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | AsExpected | cluster-version-operator | 1.617966e+09 | 1 |
Metric = cluster_operator_up
| __name__ | _id | job | name | namespace | pod | service | version | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_operator_up | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | authentication | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
| 1 | cluster_operator_up | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | baremetal | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
| 2 | cluster_operator_up | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | cloud-credential | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
| 3 | cluster_operator_up | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | cluster-autoscaler | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
| 4 | cluster_operator_up | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | config-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
Metric = cluster_version
| __name__ | _id | from_version | image | job | namespace | pod | service | type | version | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | 4.6.23 | registry.build02.ci.openshift.org/ci-op-lnd3sj... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | cluster | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1617960923 |
| 1 | cluster_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | 4.6.23 | registry.build02.ci.openshift.org/ci-op-lnd3sj... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | current | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1617797321 |
| 2 | cluster_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | 4.6.23 | registry.build02.ci.openshift.org/ci-op-lnd3sj... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | updating | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1617963108 |
| 3 | cluster_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | registry.build02.ci.openshift.org/ci-op-lnd3sj... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | completed | 4.6.23 | 1.617966e+09 | 1617962883 |
| 4 | cluster_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | registry.build02.ci.openshift.org/ci-op-lnd3sj... | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | initial | 4.6.23 | 1.617966e+09 | 1617960923 |
Metric = cluster_version_available_updates
| __name__ | _id | job | namespace | pod | service | upstream | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_version_available_updates | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | https://api.openshift.com/api/upgrades_info/v1... | 1.617966e+09 | 0 |
Metric = cluster_version_payload
| __name__ | _id | job | namespace | pod | service | type | version | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cluster_version_payload | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | applied | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 403 |
| 1 | cluster_version_payload | a4c5e284-11dd-4b9c-af67-a4776665f9df | cluster-version-operator | openshift-cluster-version | cluster-version-operator-7f6578f6df-9zkr2 | cluster-version-operator | pending | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 265 |
Metric = code:apiserver_request_total:rate:sum
| __name__ | _id | code | timestamp | value | |
|---|---|---|---|---|---|
| 0 | code:apiserver_request_total:rate:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 0 | 1.617966e+09 | 17.963446730217427 |
| 1 | code:apiserver_request_total:rate:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 200 | 1.617966e+09 | 88.57392683079763 |
| 2 | code:apiserver_request_total:rate:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 201 | 1.617966e+09 | 9.1977697122807 |
| 3 | code:apiserver_request_total:rate:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 400 | 1.617966e+09 | 0 |
| 4 | code:apiserver_request_total:rate:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 404 | 1.617966e+09 | 12.106612631578946 |
Metric = count:up0
| __name__ | _id | container | job | namespace | service | timestamp | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | count:up0 | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-rbac-proxy | machine-api-operator | openshift-machine-api | machine-api-operator | 1.617966e+09 | 1 |
Metric = count:up1
| __name__ | _id | apiserver | job | namespace | service | timestamp | value | container | metrics_path | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | count:up1 | a4c5e284-11dd-4b9c-af67-a4776665f9df | kube-apiserver | apiserver | default | kubernetes | 1.617966e+09 | 3 | NaN | NaN |
| 1 | count:up1 | a4c5e284-11dd-4b9c-af67-a4776665f9df | openshift-apiserver | api | openshift-apiserver | api | 1.617966e+09 | 2 | openshift-apiserver | NaN |
| 2 | count:up1 | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | alertmanager-main | openshift-monitoring | alertmanager-main | 1.617966e+09 | 3 | alertmanager-proxy | NaN |
| 3 | count:up1 | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | catalog-operator-metrics | openshift-operator-lifecycle-manager | catalog-operator-metrics | 1.617966e+09 | 1 | catalog-operator | NaN |
| 4 | count:up1 | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | image-registry-operator | openshift-image-registry | image-registry-operator | 1.617966e+09 | 2 | cluster-image-registry-operator | NaN |
Metric = csv_succeeded
| __name__ | _id | container | exported_namespace | job | name | namespace | pod | service | version | timestamp | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | csv_succeeded | a4c5e284-11dd-4b9c-af67-a4776665f9df | olm-operator | openshift-operator-lifecycle-manager | olm-operator-metrics | packageserver | openshift-operator-lifecycle-manager | olm-operator-675c8c5455-vljrp | olm-operator-metrics | 0.17.0 | 1.617966e+09 | 1 |
Metric = id_install_type
| __name__ | _id | install_type | timestamp | value | |
|---|---|---|---|---|---|
| 0 | id_install_type | a4c5e284-11dd-4b9c-af67-a4776665f9df | ipi | 1.617966e+09 | 0 |
Metric = id_primary_host_type
| __name__ | _id | host_type | timestamp | value | |
|---|---|---|---|---|---|
| 0 | id_primary_host_type | a4c5e284-11dd-4b9c-af67-a4776665f9df | aws | 1.617966e+09 | 0 |
Metric = id_provider
| __name__ | _id | provider | timestamp | value | |
|---|---|---|---|---|---|
| 0 | id_provider | a4c5e284-11dd-4b9c-af67-a4776665f9df | AWS | 1.617966e+09 | 0 |
Metric = id_version
| __name__ | _id | version | timestamp | value | |
|---|---|---|---|---|---|
| 0 | id_version | a4c5e284-11dd-4b9c-af67-a4776665f9df | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 0 |
Metric = id_version:cluster_available
| __name__ | _id | version | timestamp | value | |
|---|---|---|---|---|---|
| 0 | id_version:cluster_available | a4c5e284-11dd-4b9c-af67-a4776665f9df | 4.7.0-0.ci-2021-04-07-120112 | 1.617966e+09 | 1 |
Metric = instance:etcd_object_counts:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | instance:etcd_object_counts:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 13347 |
| 1 | instance:etcd_object_counts:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 13226 |
| 2 | instance:etcd_object_counts:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 13222 |
| 3 | instance:etcd_object_counts:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 535 |
| 4 | instance:etcd_object_counts:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 535 |
Metric = monitoring:container_memory_working_set_bytes:sum
| __name__ | _id | namespace | timestamp | value | |
|---|---|---|---|---|---|
| 0 | monitoring:container_memory_working_set_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | openshift-monitoring | 1.617966e+09 | 7885987840 |
Metric = monitoring:haproxy_server_http_responses_total:sum
| __name__ | _id | exported_service | timestamp | value | |
|---|---|---|---|---|---|
| 0 | monitoring:haproxy_server_http_responses_total... | a4c5e284-11dd-4b9c-af67-a4776665f9df | alertmanager-main | 1.617966e+09 | 0 |
| 1 | monitoring:haproxy_server_http_responses_total... | a4c5e284-11dd-4b9c-af67-a4776665f9df | grafana | 1.617966e+09 | 0 |
| 2 | monitoring:haproxy_server_http_responses_total... | a4c5e284-11dd-4b9c-af67-a4776665f9df | prometheus-k8s | 1.617966e+09 | 0 |
Metric = node_role_os_version_machine:cpu_capacity_cores:sum
| __name__ | _id | label_kubernetes_io_arch | label_node_hyperthread_enabled | label_node_openshift_io_os_id | label_node_role_kubernetes_io_master | timestamp | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | node_role_os_version_machine:cpu_capacity_core... | a4c5e284-11dd-4b9c-af67-a4776665f9df | amd64 | true | rhcos | true | 1.617966e+09 | 6 |
| 1 | node_role_os_version_machine:cpu_capacity_core... | a4c5e284-11dd-4b9c-af67-a4776665f9df | amd64 | true | rhcos | NaN | 1.617966e+09 | 6 |
Metric = node_role_os_version_machine:cpu_capacity_sockets:sum
| __name__ | _id | label_kubernetes_io_arch | label_node_hyperthread_enabled | label_node_openshift_io_os_id | label_node_role_kubernetes_io_master | timestamp | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | node_role_os_version_machine:cpu_capacity_sock... | a4c5e284-11dd-4b9c-af67-a4776665f9df | amd64 | true | rhcos | true | 1.617966e+09 | 3 |
| 1 | node_role_os_version_machine:cpu_capacity_sock... | a4c5e284-11dd-4b9c-af67-a4776665f9df | amd64 | true | rhcos | NaN | 1.617966e+09 | 3 |
Metric = openshift:cpu_usage_cores:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | openshift:cpu_usage_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 8.710434819619936 |
Metric = openshift:memory_usage_bytes:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | openshift:memory_usage_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 34930552832 |
Metric = openshift:prometheus_tsdb_head_samples_appended_total:sum
| __name__ | _id | job | namespace | timestamp | value | |
|---|---|---|---|---|---|---|
| 0 | openshift:prometheus_tsdb_head_samples_appende... | a4c5e284-11dd-4b9c-af67-a4776665f9df | prometheus-k8s | openshift-monitoring | 1.617966e+09 | 19098.033333333333 |
Metric = openshift:prometheus_tsdb_head_series:sum
| __name__ | _id | job | namespace | timestamp | value | |
|---|---|---|---|---|---|---|
| 0 | openshift:prometheus_tsdb_head_series:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | prometheus-k8s | openshift-monitoring | 1.617966e+09 | 1133366 |
Metric = workload:cpu_usage_cores:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | workload:cpu_usage_cores:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 0.20604137085624644 |
Metric = workload:memory_usage_bytes:sum
| __name__ | _id | timestamp | value | |
|---|---|---|---|---|
| 0 | workload:memory_usage_bytes:sum | a4c5e284-11dd-4b9c-af67-a4776665f9df | 1.617966e+09 | 780345344 |
# peak into the combined data df
with pd.option_context("display.max_columns", 50):
display(all_metrics_df.head())| __name__ | _id | alertname | alertstate | severity | timestamp | value | container | job | mode | namespace | pod | service | apiserver | label_beta_kubernetes_io_instance_type | label_kubernetes_io_arch | label_node_openshift_io_os_id | label_node_role_kubernetes_io | plugin_name | volume_mode | provisioner | networks | resource | type | region | invoker | version | condition | name | reason | from_version | image | upstream | code | metrics_path | exported_namespace | install_type | host_type | provider | exported_service | label_node_hyperthread_enabled | label_node_role_kubernetes_io_master | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | alerts | a4c5e284-11dd-4b9c-af67-a4776665f9df | AlertmanagerReceiversNotConfigured | firing | warning | 1.617966e+09 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | alerts | a4c5e284-11dd-4b9c-af67-a4776665f9df | Watchdog | firing | none | 1.617966e+09 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | cco_credentials_mode | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | NaN | NaN | 1.617966e+09 | 1 | kube-rbac-proxy | cco-metrics | mint | openshift-cloud-credential-operator | cloud-credential-operator-578dd486f4-mnb2j | cco-metrics | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | cluster:apiserver_current_inflight_requests:su... | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | NaN | NaN | 1.617966e+09 | 18 | NaN | NaN | NaN | NaN | NaN | NaN | kube-apiserver | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | cluster:apiserver_current_inflight_requests:su... | a4c5e284-11dd-4b9c-af67-a4776665f9df | NaN | NaN | NaN | 1.617966e+09 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | openshift-apiserver | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Get Data for Multiple Builds for a Given Job
In this section, we will fetch all the telemetry metrics from all timestamps for the top 10 most recent builds for a given job. This data can help understand how the behavior of the available metrics changed over time, across builds.
# fetch data from this number of builds for this job
NBUILDS = 10
# number of previous days of data to search to get the last n builds data for this job
NDAYS = 2
# max runtime of a build
# NOTE: this is a (over)estimate number derived from SME conversations, as well as time duration from testgrid
MAX_DURATION_HRS = 12# get invoker details
prev_ndays_invokers = MetricRangeDataFrame(
pc.custom_query_range(
query=f'max by (_id, invoker) (cluster_installer{{invoker=~"^openshift-internal-ci/{job_name}.*"}})',
end_time=query_eval_time,
start_time=query_eval_time - dt.timedelta(days=NDAYS),
step="5m",
)
).sort_index()
# split invoker name into prefix, job id, build id.
prev_ndays_invokers[["prefix", "job_name", "build_id"]] = prev_ndays_invokers[
"invoker"
].str.split("/", expand=True)
# drop now redundant columns.
prev_ndays_invokers.drop(columns=["invoker", "prefix", "value"], inplace=True)
# drop irrelevant columns.
prev_ndays_invokers.drop(columns=drop_cols, errors="ignore", inplace=True)
prev_ndays_invokers.head()| _id | job_name | build_id | |
|---|---|---|---|
| timestamp | |||
| 1617793200 | 70f9c6f3-eb6e-40e7-8b95-a233bba63e84 | periodic-ci-openshift-release-master-ci-4.7-up... | 1379730553359568896 |
| 1617793200 | 141ef74c-98e2-48cd-8364-fb338e3e1e37 | periodic-ci-openshift-release-master-ci-4.7-up... | 1379732995325300736 |
| 1617793500 | 70f9c6f3-eb6e-40e7-8b95-a233bba63e84 | periodic-ci-openshift-release-master-ci-4.7-up... | 1379730553359568896 |
| 1617793500 | 141ef74c-98e2-48cd-8364-fb338e3e1e37 | periodic-ci-openshift-release-master-ci-4.7-up... | 1379732995325300736 |
| 1617793800 | 141ef74c-98e2-48cd-8364-fb338e3e1e37 | periodic-ci-openshift-release-master-ci-4.7-up... | 1379732995325300736 |
# for each build, get cluster id and then the corresponding metrics from all timestamps
all_metrics_df = pd.DataFrame()
for build_id in tqdm(prev_ndays_invokers["build_id"].unique()[:NBUILDS]):
job_build_cluster_installer = pc.custom_query_range(
query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id}"}}',
end_time=query_eval_time,
start_time=query_eval_time
- dt.timedelta(days=NDAYS)
- dt.timedelta(days=MAX_DURATION_HRS),
step="5m",
)
# extract cluster id out of the installer info metric
cluster_id = job_build_cluster_installer[0]["metric"]["_id"]
# get all telemetry time series
for metric in metrics_to_fetch:
# fetch the metric
metric_result = pc.custom_query_range(
query=f'{metric}{{_id="{cluster_id}"}}',
end_time=query_eval_time,
start_time=query_eval_time
- dt.timedelta(days=NDAYS)
- dt.timedelta(days=MAX_DURATION_HRS),
step="5m",
)
if len(metric_result) > 0:
metric_df = MetricRangeDataFrame(metric_result).reset_index(drop=False)
# drop irrelevant cols, if any
metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)
# combine all the metrics data.
all_metrics_df = pd.concat(
[
all_metrics_df,
metric_df,
],
axis=0,
join="outer",
ignore_index=True,
)
all_metrics_df["value"] = all_metrics_df["value"].astype(float) 0%| | 0/10 [00:00<?, ?it/s]# visualize time series behavior across builds
for metric in all_metrics_df["__name__"].unique():
plt.figure(figsize=(15, 5))
metric_df = all_metrics_df[all_metrics_df["__name__"] == metric][
["_id", "timestamp", "value"]
]
metric_df.set_index("timestamp").groupby("_id").value.plot(legend=True)
plt.xlabel("timestamp")
plt.ylabel("value")
plt.legend(loc="best")
plt.title(metric)
plt.show()# save the metrics as a static dataset to use in future
save_to_disk(
all_metrics_df,
"../../../data/raw/",
f"telemetry-{query_eval_time.year}-{query_eval_time.month}-{query_eval_time.day}.parquet",
)TrueConclusion
In this notebook, we have :
- Collected all telemetry data corresponding to a given job and build.
- Understood how to interpret Prometheus data using an example metric.
- Collected all telemetry data from all timestamps for the top 10 most recent builds for a given job.
- Visualized what the general time series behavior of metrics looks like across builds.
- Saved the above data for further analysis.
